Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow regexes to contain ␀ byte #713

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lluchs
Copy link

@lluchs lluchs commented Jun 19, 2018

As previously noted in #359, vis has some trouble with searching in binary files. I noticed that the optional TRE library does actually support NUL bytes in the regex, and started modifying vis to make use of that.

With this patch, it's possible to search for NUL bytes by yanking such a pattern into the "/ register and then using n/N. It's not possible yet to enter a pattern containing NUL bytes directly, as the command line works with zero-terminated strings. Actually entering the pattern works fine, but it discards everything after the NUL byte. I haven't figured out how to change that yet.

@mcepl
Copy link
Contributor

mcepl commented May 28, 2024

@rnpnr I really don’t know what to do with this one. Do we really care about binary files in vis?

@rnpnr
Copy link
Collaborator

rnpnr commented May 30, 2024

Do we really care about binary files in vis?

I'm not sure. It definitely doesn't really work right now. As for this patch its not really intrusive and any time I can use strings with a length instead of the NUL terminated mistake I prefer it.

Actually entering the pattern works fine, but it discards everything after the NUL byte. I haven't figured out how to change that yet.

I think you probably need to use tre_regnexec() instead of tre_regexec() when you actually go to perform the match. That also means that the length needs to be specified but that shouldn't be a problem. I would also prefer that this patch was modified so that tre_regncomp() is always used in text_regex_compile() and therefore the length is always required.

@lluchs if you are still interested in this I will look at any updates otherwise someone else can take this over. I don't think any of these changes should be too complicated.

@lluchs
Copy link
Author

lluchs commented Jun 14, 2024

Thanks for having a look at this PR! I rebased it to current master.

I think you probably need to use tre_regnexec() instead of tre_regexec() when you actually go to perform the match.

text_regex_match is actually only used exactly once for a null-terminated string here:

vis/sam.c

Line 1590 in a7aac10

(w->file->name && text_regex_match(cmd->regex, w->file->name, 0) == 0);

The actual matching in the text is already null-byte-safe (see str_next_char).

I would also prefer that this patch was modified so that tre_regncomp() is always used in text_regex_compile() and therefore the length is always required.

I did that now and managed to make / commands with null bytes work. With that, I think everything works correctly now.

This allows entering patterns containing a null byte with /.
@@ -1590,8 +1589,7 @@ Regex *vis_regex(Vis *vis, const char *pattern) {
text_regex_free(regex);
return NULL;
}
if (len == 0)
register_put0(vis, &vis->registers[VIS_REG_SEARCH], pattern);
register_put(vis, &vis->registers[VIS_REG_SEARCH], pattern, len+1);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The +1 here is not very intuitive. For some reason, register_get gives us a length without the terminating null byte (which is still part of the buffer), but register_put expects a length including the null.

@lluchs lluchs changed the title WIP: Allow regexes to contain ␀ byte Allow regexes to contain ␀ byte Jun 14, 2024
vis.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@rnpnr rnpnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Lukas,
After reviewing again I think this looks good. I just left one other comment and then this can be merged!

vis-motions.c Outdated
Comment on lines 19 to 22
Regex *regex = vis_regex(vis, expr);
Regex *regex = vis_regex(vis, expr, strlen(expr));
if (!regex) {
snprintf(expr, sizeof(expr), "\\<%s\\>", buf);
regex = vis_regex(vis, expr);
regex = vis_regex(vis, expr, strlen(expr));
Copy link
Collaborator

@rnpnr rnpnr Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strlens can be avoided by saving the return value from snprintf.
Edit: Check line 18 for the other snprintf; github messed up my response.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that now by removing the ~500 character limit on the word search. That seemed to be better than handling the case where snprintf truncates its output.

Words longer than ~500 bytes would previously result in a truncated
search pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants